{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `attention_mask`在处理多个序列时的作用\n",
    "\n",
    "现在我们训练和预测基本都是批量化处理的，而前面展示的例子很多都是单条数据。单条数据跟多条数据有一些需要注意的地方。\n",
    "\n",
    "## 处理单个序列\n",
    "\n",
    "我们首先加载一个在情感分类上微调过的模型，来进行我们的实验（注意，这里我们就不能能使用`AutoModel`，而应该使用`AutoModelFor*`这种带Head的model）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pprint import pprint as print  # 这个pprint能让打印的格式更好看一点\n",
    "from transformers import AutoModelForSequenceClassification, AutoTokenizer\n",
    "checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'\n",
    "tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n",
    "model = AutoModelForSequenceClassification.from_pretrained(checkpoint)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对一个句子，使用tokenizer进行处理："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]]),\n",
      " 'input_ids': tensor([[ 101, 2651, 2003, 1037, 3835, 2154,  999,  102]])}\n"
     ]
    }
   ],
   "source": [
    "s = 'Today is a nice day!'\n",
    "inputs = tokenizer(s, return_tensors='pt')\n",
    "print(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到，这里的inputs包含了两个部分：`input_ids`和`attention_mask`.\n",
    "\n",
    "模型可以直接接受`input_ids`并输出："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[-4.3232,  4.6906]], grad_fn=<AddmmBackward>)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model(inputs.input_ids).logits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "也可以通过`**inputs`同时接受`inputs`所有的属性："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[-4.3232,  4.6906]], grad_fn=<AddmmBackward>)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model(**inputs).logits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上面两种方式的**结果是一样的**。\n",
    "\n",
    "## 但是当我们需要同时处理**多个序列**时，情况就有变了！"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],\n",
      "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),\n",
      " 'input_ids': tensor([[  101,  2651,  2003,  1037,  3835,  2154,   999,   102,     0,     0,\n",
      "             0],\n",
      "        [  101,  2021,  2054,  2055,  4826,  1029, 10047,  2025,  2469,  1012,\n",
      "           102]])}\n"
     ]
    }
   ],
   "source": [
    "ss = ['Today is a nice day!',\n",
    "      'But what about tomorrow? Im not sure.']\n",
    "inputs = tokenizer(ss, padding=True, return_tensors='pt')\n",
    "print(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后，我们试着直接把这里的`input_ids`喂给模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[-4.1957,  4.5675],\n",
       "        [ 3.9803, -3.2120]], grad_fn=<AddmmBackward>)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model(inputs.input_ids).logits  # 第一个句子原本的logits是 [-4.3232,  4.6906]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "发现，第一个句子的`logits`变了！\n",
    "\n",
    "这是**因为在padding之后，第一个句子的encoding变了，多了很多0， 而self-attention会attend到所有的index的值，因此结果就变了**。\n",
    "\n",
    "这时，就需要我们不仅仅是传入`input_ids`，还需要给出`attention_mask`，这样模型就会在attention的时候，不去attend被mask掉的部分。\n",
    "\n",
    "\n",
    "因此，在处理多个序列的时候，正确的做法是直接把tokenizer处理好的结果，整个输入到模型中，即直接`**inputs`。\n",
    "通过`**inputs`，我们实际上就把`attention_mask`也传进去了:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[-4.3232,  4.6906],\n",
       "        [ 3.9803, -3.2120]], grad_fn=<AddmmBackward>)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model(**inputs).logits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在第一个句子的结果，就跟前面单条处理时的一样了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
